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Abstract 

We present a novel characterization of how a pro- 
gram stresses cache. This characterization permits 
fast performance prediction in order to simulate 
and assist task scheduling on heterogeneous clus- 
ters. It is based on the estimation of stack distance 
probability distributions. The analysis requires the 
observation of a very small subset of memory ac- 
cesses, and yields a reasonable to very accurate pre- 
diction in constant time. 

1 Introduction 

Heterogeneous resources bring in clusters the op- 
portunity for workload placement optimizations 
il8j . Cache is a core resource. The behavior 
of a program relative to cache determines in great 
part its performance on a given server. However, 
cache misses are difficult to predict. In order to en- 
hance schedulers by taking into account cache re- 
sources, programs must be analyzed quickly. The 
program analysis overhead must not overpass the 
gain in scheduling efficiency. 

This work is a first step towards cache- aware 
scheduling in heterogeneous clusters. It consists in 
the design and evaluation of a new program charac- 
terization. This characterization permits fast cache 
misses prediction. It provide the required perfor- 
mance for use in schedulers. It is based on the esti- 
mation of stack distance probability distributions. 

Related characterizations are presented in sec- 
tion jSJ The scope of this work is defined in section 
[S] The design of the new characterization is ex- 
plained in section |4] and evaluated in section [Sj 
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2 Related work 

Cycle- accurate simulators return a cache event in 
response to each instruction. They require a han- 
dle on the application being executed [17] or an ex- 
haustive trace of the execution [16l [19] . Although 
trace compression methods exist, these simulators 
are slow compared to other predictors [7]. 

How well a program behaves relative to cache 
has been explained in the literature with the no- 
tions of program locality [131 Ull (2] • Program lo- 
cality has a variety of descriptions. Reducing the 
description size has always been a challenge for per- 
formance prediction. Programs can be decomposed 
into building blocks [9] |22] [23] . Resulting descrip- 
tions are still substantial and they do not apply to 
all kinds of caches. 

Monte Carlo performance models represent a 
program as inter-dependent statistical generators 
of stall conditions [8] [20l [21]. These models are 
fast. The average number of cache misses in a run 
is correct even for complex processors. However, 
the cache misses generators used in these works are 
still specific to a cache configuration. 

Fast cross-platform cache analysis is usually done 
using stack distances. A stack distance is the 
number of different memory addresses accessed be- 
tween two accesses to the same address. Stack dis- 
tances are suited to evaluate fully associative caches 
with Least Recently Used (LRU) replacement pol- 
icy and with cache lines of one element. In these 
cases, and in the absence of pre-fetching, cache 
misses occur for stack distances greater than the 
cache size. In addition, stack distances have shown 
to accurately extend to set-associative caches with 
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various cache hue sizes and replacement pohcies 

HSHEIEIEIII]. 

For prediction, stack distances are usuahy 
recorded in a stack distance histogram. The pre- 
cision of a histogram (i.e. the range of its bins) 
is usuahy the size of a cache hue. Stack distance 
histograms contain the number of cache misses for 
every cache size. Stack distance histograms are 
widely used for cross-platform performance predic- 
tion [TOl HH |4] . They are lighter than application 
traces when the cache line size is known. However, 
their size is still substantial and the whole trace 
still needs to be collected. 

3 Scope of this contribution 

This section explains the limitations and novelty of 
the characterization. 

Limitations. This characterization aims to pre- 
dict the number of cache misses. The cost of a 
cache miss and the impact of pre-fetching are not 
studied here, although they are important to simu- 
late and assist cache-aware scheduling. They must 
be addressed separately. 

Cost of a cache miss. In modern processor ar- 
chitectures the cost of a cache miss on the process 
execution time depends on memory latency and 
bandwidth, the number of hardware threads, the 
quality of branch prediction, other platform char- 
acteristics, and on whether it occurs during direct 
or speculative execution. Evaluating the cost of a 
cache miss is not the concern of this work, which 
focuses on their number. 

Pre-fetching. Modern processors use pre- 
fetching, a strategy that consists of loading data 
to cache before it is required in the program stack. 
Pre-fetching takes advantage of spatial locality. 
Along with efficient branch prediction, pre-fetching 
dramatically reduces the number of cache misses. 
However pre-fetching is externally scheduled by 
processors. It does not belong to cache configu- 
ration. The evaluation of how well it filters out 
cache misses can be done separately, as in [21, 8 . 

Compulsory cache misses correspond to ffist- 
time accessed memory addresses, that is, to infinite 
stack distances. Compulsory instruction misses 
are given by the binary size and compulsory data 



misses are given by the data size. The characteriza- 
tion predicts capacity and conflict misses according 
to the standard taxonomy 

Novelty. We propose a new characterization of 
how a program stresses cache. This characteriza- 
tion outperforms current methods for description 
size, analysis and prediction speed. It accounts for 
constant prediction complexity and for the fastest 
analysis since only small subsets of the applica- 
tion trace need to be extracted. It permits cross- 
platform cache performance prediction with reason- 
able to very good accuracy. 

These performances are required to provide on 
the fly performance prediction in order to simulate 
and assist task scheduling on heterogeneous clus- 
ters. 

4 The characterization 

We propose a characterization based on the estima- 
tion of the stack distance probability distribution. 
Stack distance is seen as a random variable X. It 
is fitted to a combination of well known probability 
distributions. The obtained distribution has a cu- 
mulative distribution function cdf{x) = F{X < x). 
If the estimation is correct, the cache misses ratio 
is F{X > cs/ls) = 1 — cdf{cs/ls) where cs is the 
cache size and Is the line size. The prediction is 
thus of constant complexity. 

In addition we propose a method to refine a sim- 
ple fit. Cache misses prediction requires to fit cor- 
rectly only the upper values of random variable X. 
Indeed, prediction is only useful for realistic cache 
sizes. If one determines that no cache is smaller 
than a minimal cache size ms, then X must fit the 
distribution correctly for values greater than ms/ls 
where Is is the line size. 

The refinement algorithm is as follows. X is a 
random variable, in fact a list of samples, dist rep- 
resents the parameters of a distribution, i.e. the 
result of a random variable fit. Function fit is a 
regular fit. Function fit' is the refined fit. 

function bias(X, dist, ms/ls) : 

for each s in X such that s < ms/ls 
do 

s^ := randomly generated from dist 
loop until s^ < ms/ls 
s := s* 
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end for 
return X 
end function 

function fit^(X, ms/ls) : 
dist := fit(X) 
X^ := X 

for each refinement 

X' := bias(XS dist, ms/ls) 

dist := fit(X') 
end for 
return dist 
end function 

At each refinement, randomly generated values 
based on the previous estimation replace the lower 
samples. Suppose that an estimation minimizes the 
Mean Squared Error e. Cij is the error of estimation 
at ith refinement on data at jth refinement, e^^^"^ 
and e^^ are the contributions of lower and upper 
samples to the error. en+i,n+i ^ ^n,n-\-i because es- 
timation n + 1 minimizes the error on data n + 1. It 

Since 
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ter fitted after each refinement. 

The remaining of this paper is an evaluation of 
the characterization based on the analysis of SPEC 
CPU2006 benchmarks. 



5 Evaluation 

We instrumented SPEC binaries with PIN to ob- 
tain instructions and data load traces [12]. Stack 
distances are extracted using the trace profiling al- 
gorithm [4 . We developed a few tools in Java. 
These tools include trace analysis, estimators based 
on the Method of Moments for a large spectrum of 
distributions (Discrete, Uniform, Gamma, General- 
ized Pareto (CP) and Half Normal (HN)), random 
number generators for each of these distributions, 
and estimation refinemenlE] 

The evaluation is presented in three steps. The 
first step shows how well stack distances fit a prob- 
ability distribution. The second step shows the ef- 
fect of collecting a limited number of stack distance 
samples. The third step is a discussion on using the 



-•^ All tools and data developed and collected for this work 
are available at http: / / code.google.com /p/ mtc-project ^ 



analysis to predict cache misses of other parts of the 
program and with other input data. 

5.1 Stack distance distribution fit 
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Figure 1: Stack distances visualization. 

Figure [T] shows stack distances on a representa- 
tive cross-section of SPEC CPU2006 benchmarks. 
Left-hand-side figures show stack distances between 
data accesses, and right-hand-side figures show 
stack distances between instruction accesses. Light 
dots are stack distances in chronological order. 
Dashed lines are the outlines. They are made with 
the same values in descending order. The plots of 
figure [l] differ from histograms. On a histogram, 
values are on the x axis and the y axis measures 
the number of occurrences. On fis;ure [l]the X axis 
is a list of memory accesses and the y axis measures 
corresponding stack distances. 

In general, outlines are composed of curves and 
straight segments. An outline exclusively com- 
posed of straight segments indicates that the vari- 
able perfectly fits a discrete distribution. In this 
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case the characterization is equivalent to estimat- 
ing the histogram. It results in a compressed his- 
togram where empty bins are removed . To the 
contrary, a curve indicates that a histogram would 
require a high number of bins. When curves exist, 
fitting a continuous distribution dramatically re- 
duces the characterization size, for continuous dis- 
tribution is determined with typically two or three 
parameters. Among the 28 SPEC CPU2006 bench- 
marks, 11 have discrete instruction stack distances, 
and three (gromacs, Ibm and libquantum) have dis- 
crete data stack distances. In general, a stack dis- 
tance distribution is the sum of a discrete distribu- 
tion and continuous distributions. 



Data stack distance histogram: GemsFDTD 10000 samples 
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Figure 2: A problematic fit. 

Figure [2] illustrates the analysis of GemsFDTD. 
The outline is shown along with Monte Carlo sim- 
ulations based on different analysis. For analy- 
sis, discrete parts are filtered out and fitted sep- 
arately. The remaining samples are fitted to a con- 
tinuous distribution. HN fits well the upper part 
of the curve and GPD the lower part. However, 
the whole curve does not fit any single distribution 
alone. Gamma and Uniform average the trends. 
The characterization does not accurately account 
for all stack distances in a program whose out- 
line has an inflexion point. 8 data traces out of 
the 28 benchmarks fall into this category. In these 
worst cases, the refined fit permits to concentrate 
on the higher stack distances that account for cache 
misses. 

Figure [3] illustrate six representative analysis sce- 
narios. For each benchmark the best distribution 
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Figure 3: Fit quality in various scenarios. 

is selected, and the fit is refined. The selection can 
be done automatically by picking the distribution 
that accounts for the smallest estimation error. For 
the three first benchmarks, the estimation is quite 
accurate. Refined estimations give the best results. 
Indeed, outlines have long tails that bias the esti- 
mation of upper values in the absence of refinement. 
The precision on the fourth benchmark is impaired 
by the precision of the discrete fit. The two last 
benchmarks are the worst cases, one because of its 
heavy tail and the other because of its irregular 
outline. 

In conclusion, the characterization is accurate for 
most SPEC CPU2006 benchmarks. Stack distances 
predicted in the Monte-Carlo simulations perfectly 
match actual values. However, there are impracti- 
cable scenarios where the precision is the order of 
magnitude. 

5.2 Analysis speed and prediction 
accuracy 



In section |5.1| we considered the ability of a prob- 
ability distribution to accurately reproduce stack 
distances and thus predict cache misses for any kind 
of cache. The difference between actual stack dis- 
tances and the best distribution is a first contri- 
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bution to the prediction error. In this section we 
evaluate the number of samples required to fit such 
a distribution. The limited number of stack dis- 
tance samples introduces another contribution to 
the prediction error. There must be just enough 
samples to obtain an estimation as close to the best 
estimation as the best estimation to the real data. 
In the following this number of samples is called 
adequate. 
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Figure 4: Data cache misses prediction: soplex. 
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Instruction cache misses prediction, dealll 
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Although refined fits are better in general^ they 
are not better for all cache sizes. For soplex data 
misses prediction with refined fits, the adequate 
number of samples is around 2^. One sample must 
be collected every 50,000 data accesses in memory. 
This number yields a prediction accuracy of 99%. 
For dealll instruction misses prediction, the ade- 
quate number of samples is around 2^^. One sam- 
ple must be collected every 13,000 instructions, for 
an accuracy of 99.6%. 

In conclusion, accurate predictions are obtained 
with fast analysis. Less accurate predictions can be 
done faster. 

5.3 Prediction robustness 

In this section we briefiy discuss the characteriza- 
tion accuracy to predict cache misses in the future 
and with different input data. 
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Figure 5: Inst, cache misses prediction: dealll. 

On figure [4] and [5j two benchmarks are exam- 
ined. Actual cache misses ratios with two different 
caches are compared to predictions based on differ- 
ent sample sets. The same characterization is used 
to predict cache misses for the two cache sizes. 



Figure 6: Two sample sets of bzip2. 

Figure [6] illustrates the variation of bzip2 stack 
distances outline. Samples are taken from millions 
of consecutive memory accesses, and the two sam- 
ple sets are separated by a few seconds. The sam- 
ples are represented in chronological order on sepa- 
rate figures (top). The outlines are represented on 
the same picture (bottom). With bzip2 the outlines 
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are roughly similar, but a precise prediction is not 
possible for future cache misses. 
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Figure 7: Different inputs and observation seg- 
ments. 

Figure [7| shows the evolution of the outline for 
four benchmarks, sjeng does not change its mem- 
ory access pattern in time, gobmk does not change 
with different input data. To the contrary, astar 
instruction access pattern changes in time, as well 
as wrf data access pattern. 

In conclusion, future behaviors can be predicted 
only if the program is known to follow a certain reg- 
ularity. For example, scientific computations often 
involve the repetitive execution of the same rou- 
tines [22J. In some cases, as with gobmk, different 
input data do not change memory access patterns. 
The analysis, unnoticeable on the first run of a rou- 
tine or on the first input data, provides at worst a 
rough indication of the cache misses ratio, useful for 
cache-aware scheduling. In other cases it predicts 
future cache misses with very high precision. 

6 Conclusion 

We presented a novel characterization of how a pro- 
gram stresses cache, in terms of the stack distance 
fit to a probability distribution. The characteriza- 
tion has a very small size and provides cache misses 
predictions in constant time. Its evaluation distin- 
guishes three contributions to the prediction error. 
One is relative to the appropriateness of a proba- 
bility distribution to describe stack distances. The 
second is relative to the number of samples used 



for the fit, and the third is relative to the changes 
in program behavior. The worst cases yield to rea- 
sonable accuracy to simulate or assist scheduling 
systems. Many application behaviors are very ac- 
curately described by probability distributions and 
have enough regularity for the prediction to apply 
under different circumstances. Fitting a distribu- 
tion requires the extraction of a very small subset of 
the trace. This makes the analysis extremely fast, 
which is needed to simulate and assist scheduling 
systems. 
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